Exploratory data analysis
for collagen diseases dataset
1 Data summary
In order to explore the mechanisms of collagen diseases, a dataset is extracted from a University hospital database. The overall objective is to study potential factors helping to detect and predict thrombosis, one of the severe complications cased by collagen diseases. The dataset contains three parts: users’ information and diagnosis, examination results of users’ having thrombosis, and general laboratory examination results.
The examination related to thrombosis mainly through the blood test. Thus, the dataset describing examination results of users’ having thrombosis contains anti-Cardiolipin antibody measurement and degree of coagulation measurement (the action or process of blood changing to a solid or semi-solid state), along with the degree of thrombosis. The general laboratory examination results include general blood test, such as Red blood cell count, blood glucose, and total bilirubin. Those tests are not necessary relate to diagnosed thrombosis and can happen anytime when doctors think the patient need them. The three datasets are connected by patient ID. The datasets are one to many relation. For example, one patient can have various tests on same or different date. And patients can also have one or more diagnosis.
2 Initial questions
The initial questions are listed below.- What are the symptoms causing the doctor suspect a patient is having a thrombosis?
- Are some of the measurements correlated to each other?
- Are some of the symptoms always occurring together?
3 Data munging
#| echo: false #| warning: false ### this cell is used to run python in qmd so that one notebook can run both r and python library(reticulate) use_python(“/Users/xiaojingni/miniforge3/envs/anly503/bin/python”)
3.1 TSUMOTO_A
3.1.1 load patient data
Some feature meaning- Description: the date when a patient was input
- First date: the date when a patient came to the hospital
- Admission: patient was admitted to the hospital or followed at the outpatient clinic
- Diagnosis: some patients may suffer from several diseases
3.1.2 Missing values and duplicates
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1240 entries, 0 to 1239
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 1240 non-null int64
1 SEX 1227 non-null object
2 Birthday 1239 non-null object
3 Description 1023 non-null object
4 First Date 992 non-null object
5 Admission 1209 non-null object
6 Diagnosis 1240 non-null object
dtypes: int64(1), object(6)
memory usage: 67.9+ KB
There are some missing values. The missing values are shown below. All of the records have a Id and Diagnosis. Birthday has one missing value. There are 248 patient without the “First Date” information (the date when a patient came to the hospital).
ID 0
SEX 13
Birthday 1
Description 217
First Date 248
Admission 31
Diagnosis 0
dtype: int64
Check if ID column has duplicate value
# check if ID have duplicate value
if len(patients_df_raw) != len(set(patients_df_raw.ID)):
print("duplicates found in the list")
else:
print("No duplicates found in the list")duplicates found in the list
List duplicate ID and their records
# below code is revised from https://www.trainingint.com/how-to-find-duplicates-in-a-python-list.html
id_set = set()
dupset = set()
for i in patients_df_raw.ID:
if i not in id_set:
id_set.add(i)
else:
# this method catches the first duplicate entries, and appends them to the list
dupset.add(i)
# The next step is to print the duplicate entries, and the unique entries
print("List of duplicates\n",
patients_df_raw.loc[patients_df_raw['ID'].isin(list(dupset))])List of duplicates
ID SEX Birthday Description First Date Admission Diagnosis
264 2557319 F 1959/9/27 NaN NaN + SLE
265 2557319 F 1984/9/27 98.07.28 91.11.05 + SLE
1214 5807039 F 1972/11/8 98.06.22 98.06.08 - APS
1215 5807039 F 1972/11/8 98.07.28 98.06.08 - APS
For patient 2557319, only one record has full data. Thus, the duplicate is probably from a bad entry. Delete one record will solve the problem. For patient 5807039, the first date is the same, but the date input is not, keeping the one with early input.
3.1.3 Cleaning and data format
Some of the date entries have type, replacing the letter with “” with solve this. Below I created three new column to change the format of the date columns: “Birthday”, “Description”, and “First Date”.
# reformat
# Birthday
patients_df_raw['b-day'] = pd.to_datetime(patients_df_raw['Birthday'])
# Description day
patients_df_raw['d-day'] = pd.to_datetime(patients_df_raw['Description'])
# First Date, remove typo letters
patients_df_raw['f-day'] = pd.to_datetime(
patients_df_raw['First Date'].replace(r"[A-Z]", "", regex=True))3.1.4 New variables
In this session, new columns of age_of_first_come and diagnosis-n are created. age_of_first_come is created based on the difference of Description and birthday. And diagnosis-n is generated by parsing the diagnosis column.
patients_df_raw['age_of_first_come'] = patients_df_raw.apply(
lambda row: relativedelta.relativedelta(row['d-day'], row['b-day']).years, axis=1)
# sanity check
patients_df_raw.head(5)| ID | SEX | Birthday | Description | First Date | Admission | Diagnosis | b-day | d-day | f-day | age_of_first_come | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2110 | F | 1934/2/13 | 94.02.14 | 93.02.10 | + | RA susp. | 1934-02-13 | 1994-02-14 | 1993-02-10 | 60 |
| 1 | 11408 | F | 1937/5/2 | 96.12.01 | 73.01.01 | + | PSS | 1937-05-02 | 1996-12-01 | 1973-01-01 | 59 |
| 6 | 43003 | M | 1937/11/24 | 94.03.08 | 94.03.08 | - | Raynaud's phenomenon | 1937-11-24 | 1994-03-08 | 1994-03-08 | 56 |
| 9 | 57266 | M | 1923/7/25 | 97.02.03 | 93.02.05 | + | RA | 1923-07-25 | 1997-02-03 | 1993-02-05 | 73 |
| 11 | 71417 | F | 1929/10/20 | 96.12.03 | 95.06.20 | - | RA | 1929-10-20 | 1996-12-03 | 1995-06-20 | 67 |
# parsing diagnosis
diagnosis_terms = patients_df_raw['Diagnosis'].str.split(", ")
d = collections.Counter()
for phrases in diagnosis_terms:
for phrase in phrases:
d[phrase] += 1
sorted(d.items(), key=lambda x: x[1], reverse=True)[('SJS', 220),
('SLE', 186),
('RA', 157),
('BEHCET', 62),
('PSS', 55),
('MCTD', 37),
('SJS susp', 36),
('APS', 35),
('PM', 30),
('RA susp', 22),
('SLE susp', 21),
('FUO', 15),
('DM', 14),
('PN', 10),
('MRA', 9),
('AORTITIS', 8),
('Raynaud', 7),
('DM (amyopathic)', 7),
('RA (seronegative)', 6),
('APS susp', 5),
('Adult STILL', 5),
('collagen susp', 5),
('PSS susp', 4),
('BEHCET susp', 4),
('Scleroderma', 4),
('PNc', 3),
('PMR susp', 3),
('PMR', 3),
('WG', 3),
('relapsingü@polychondritis', 2),
('AIHA', 2),
('RA,SJS', 2),
('PM susp', 2),
('relapsing polychondritis', 2),
('JRA', 2),
('UC', 2),
('AGA', 2),
('ITP', 2),
('Sarcoidosis', 2),
('PM/DM', 2),
('Sarcoidosis susp', 2),
('MCTD susp', 2),
('RA susp.', 1),
("Raynaud's phenomenon", 1),
('BEHCET (òsæSî^)', 1),
('PN (vasculitis)', 1),
('Sweet', 1),
('PSS(CREST)', 1),
('Hyper CPK', 1),
('BEHCET,neuro', 1),
('BEHCET-vasculo', 1),
('RA\x1dSJS\x1dPM', 1),
('Adie', 1),
('Vasculitis susp', 1),
('Vasculitis', 1),
('DM susp', 1),
('BEHCETüineuroüj', 1),
('PBC', 1),
('Psoriatic Arthritis', 1),
("Reiter's syndrome susp", 1),
('Chronic EBV', 1),
('BEHCET (vasculo)', 1),
('BEHCET(entero)', 1),
('drug induced hepatitis', 1),
('ANAPHYLACTOID PURPURA NEPHRITIS', 1),
('McArdle', 1),
('Psoriatic arthritis', 1),
('ASO ', 1),
('APS\x1dAPS', 1),
('PN susp', 1),
('PSS,SJS', 1),
('AORTITIS susp', 1),
('Myasthenia G', 1),
('pemphygus', 1),
('IP', 1),
('ìéCPKîîÅÌ', 1),
('PSS(sclerodermatomyositis)', 1),
('CREST', 1),
('ANA', 1),
('SJS\x1dMCTD', 1),
('lupoid hepatitis', 1),
('collagen disease susp', 1),
('RA(seronegative)', 1),
('Angioedema with eosinophilia', 1),
('arthralgia', 1),
('DLE', 1),
('FUO (STILL susp.)', 1),
('BEHCET (entero)', 1),
('CREST synd susp', 1),
('generalized morphea', 1),
('PNc susp', 1),
('ANCA', 1),
('EN', 1),
('Chrnoic Palvo V', 1),
('WG\x1dSJS', 1),
('SJS susp\x0b\x1dSJS', 1),
('Chronic EB virus infection susp', 1),
('PSS susp\x1dPSS', 1),
('Chronic EB', 1),
('Henoch-Schoelein purpura', 1),
('Morphia', 1),
('Focal Myositis', 1),
('Vasculitis synd. susp', 1),
('Weber-Christian', 1)]
diag = patients_df_raw['Diagnosis'].str.split(", ", expand=True)
diag.columns = ["Diagnosis"+str(i) for i in range(1, 5)]
patients_df = pd.concat([patients_df_raw, diag], axis=1)
patients_df.head(5)| ID | SEX | Birthday | Description | First Date | Admission | Diagnosis | b-day | d-day | f-day | age_of_first_come | Diagnosis1 | Diagnosis2 | Diagnosis3 | Diagnosis4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2110 | F | 1934/2/13 | 94.02.14 | 93.02.10 | + | RA susp. | 1934-02-13 | 1994-02-14 | 1993-02-10 | 60 | RA susp. | None | None | None |
| 1 | 11408 | F | 1937/5/2 | 96.12.01 | 73.01.01 | + | PSS | 1937-05-02 | 1996-12-01 | 1973-01-01 | 59 | PSS | None | None | None |
| 6 | 43003 | M | 1937/11/24 | 94.03.08 | 94.03.08 | - | Raynaud's phenomenon | 1937-11-24 | 1994-03-08 | 1994-03-08 | 56 | Raynaud's phenomenon | None | None | None |
| 9 | 57266 | M | 1923/7/25 | 97.02.03 | 93.02.05 | + | RA | 1923-07-25 | 1997-02-03 | 1993-02-05 | 73 | RA | None | None | None |
| 11 | 71417 | F | 1929/10/20 | 96.12.03 | 95.06.20 | - | RA | 1929-10-20 | 1996-12-03 | 1995-06-20 | 67 | RA | None | None | None |
3.2 TSUMOTO_B
3.2.1 load data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 806 entries, 0 to 805
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 770 non-null float64
1 Examination Date 796 non-null object
2 aCL IgG 806 non-null float64
3 aCL IgM 806 non-null float64
4 ANA 768 non-null object
5 ANA Pattern 545 non-null object
6 aCL IgA 806 non-null float64
7 Diagnosis 475 non-null object
8 KCT 146 non-null object
9 RVVT 146 non-null object
10 LAC 222 non-null object
11 Symptoms 80 non-null object
12 Thrombosis 806 non-null int64
dtypes: float64(4), int64(1), object(8)
memory usage: 82.0+ KB
# values of different columns
for nm in ["KCT", "RVVT", "LAC", "ANA", "ANA Pattern"]:
print(slab_df_raw[nm].value_counts())
print("-------------------")- 105
+ 41
Name: KCT, dtype: int64
-------------------
- 99
+ 47
Name: RVVT, dtype: int64
-------------------
- 147
+ 75
Name: LAC, dtype: int64
-------------------
0 206
16 155
64 112
4 108
256 82
1024 58
4096 45
4094 1
>4096 1
Name: ANA, dtype: int64
-------------------
S 252
P 178
P,S 89
S,P 11
D,P,S 2
S,D 2
D,P 2
p 2
P.S 1
P,D 1
S,N 1
D,S 1
D 1
N 1
P.D 1
Name: ANA Pattern, dtype: int64
-------------------
3.2.2 Missing values and duplicates
The dataset contains missing values for some of the columns. Thus, for different purpose, I will use different strategies to deal with missing data. This dataset allows ID has duplication, Thus, here, I only check for duplicate entries for data quality check purpose.
Joining with other datasets requires ID column. Thus, for join analysis purpose, those records without ID information will be removed.
| ID | Examination Date | aCL IgG | aCL IgM | ANA | ANA Pattern | aCL IgA | Diagnosis | KCT | RVVT | LAC | Symptoms | Thrombosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14872.0 | 1997/5/27 | 1.3 | 1.6 | 256 | P | 0.0 | MCTD, AMI | NaN | NaN | - | AMI | 1 |
| 1 | 48473.0 | 1992/12/21 | 4.3 | 4.6 | 256 | P,S | 3.3 | SLE | - | - | - | NaN | 0 |
| 2 | 102490.0 | 1995/4/20 | 2.3 | 2.5 | 0 | NaN | 3.5 | PSS | NaN | NaN | NaN | NaN | 0 |
| 3 | 108788.0 | 1997/5/6 | 0.0 | 0.0 | 16 | S | 0.0 | NaN | NaN | NaN | - | NaN | 0 |
| 4 | 122405.0 | 1998/4/2 | 0.0 | 4.0 | 4 | P | 0.0 | SLE, SjS, vertigo | NaN | NaN | NaN | NaN | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 801 | NaN | 1996/11/21 | 3.3 | 1.7 | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | 0 |
| 802 | NaN | 1995/10/2 | 1.2 | 3.0 | 4 | NaN | 1.8 | NaN | NaN | NaN | NaN | NaN | 0 |
| 803 | NaN | 1997/7/14 | 0.0 | 2.2 | NaN | NaN | 0.0 | PSS | NaN | NaN | NaN | NaN | 0 |
| 804 | NaN | NaN | 0.0 | 19.8 | 4096 | S | 0.0 | SLE+PH | NaN | NaN | NaN | PH | 1 |
| 805 | NaN | 1998/1/16 | 0.0 | 2.0 | 16 | P | 0.0 | NaN | NaN | NaN | NaN | NaN | 0 |
806 rows × 13 columns
<bound method DataFrame.info of ID Examination Date aCL IgG aCL IgM ANA ANA Pattern aCL IgA \
0 14872.0 1997/5/27 1.3 1.6 256 P 0.0
1 48473.0 1992/12/21 4.3 4.6 256 P,S 3.3
2 102490.0 1995/4/20 2.3 2.5 0 NaN 3.5
3 108788.0 1997/5/6 0.0 0.0 16 S 0.0
4 122405.0 1998/4/2 0.0 4.0 4 P 0.0
.. ... ... ... ... ... ... ...
765 5771963.0 1998/3/13 0.0 1.5 64 P 0.0
766 5775402.0 1998/3/23 0.0 15.6 256 P,S 7.0
767 5775806.0 1998/3/24 0.0 0.9 16 P 0.0
768 5779550.0 1998/3/31 0.0 2.5 4 S 0.0
769 9334041.0 1995/6/8 3.0 2.4 0 NaN 3.9
Diagnosis KCT RVVT LAC Symptoms Thrombosis
0 MCTD, AMI NaN NaN - AMI 1
1 SLE - - - NaN 0
2 PSS NaN NaN NaN NaN 0
3 NaN NaN NaN - NaN 0
4 SLE, SjS, vertigo NaN NaN NaN NaN 0
.. ... ... ... ... ... ...
765 MCTD NaN NaN NaN NaN 0
766 PMR NaN NaN NaN NaN 0
767 NaN NaN NaN NaN NaN 0
768 SLE susp NaN NaN NaN NaN 0
769 NaN NaN NaN NaN NaN 0
[770 rows x 13 columns]>
3.2.3 Cleaning and data format
3.2.3.1 Parsing Diagnosis column
The diagnosis column contains one or more diagnosis. This session is to parse the column into several key words for further analysis. For those thrombosis positive patient, first, using special characters (including space) parses the text, and then calculating the top 10 frequency words. The similar procedure is repeated for those thrombosis negative patient.
# Thrombosis positive words
positive_words = [re.split(r'[^a-zA-Z0-9/s]+', text) for text, label in
zip(slab_df_raw['Diagnosis'], slab_df_raw['Thrombosis']) if label >= 1 and type(text) != float]
positive_dict = collections.Counter()
for phrases in positive_words:
positive_dict += collections.Counter(phrases)
# clean up keywords
if "" in positive_dict:
positive_dict.pop("")
positive_top10 = sorted(positive_dict.items(),
key=lambda x: x[1], reverse=True)[:10]
positive_top10[('SLE', 44),
('APS', 17),
('SjS', 11),
('RA', 7),
('infarction', 6),
('MCTD', 5),
('brain', 5),
('CNS', 3),
('AMI', 2),
('abortion', 2)]
# Thrombosis negative words
negative_words = [re.split(r'[^a-zA-Z0-9/s]+', text) for text, label in
zip(slab_df_raw['Diagnosis'], slab_df_raw['Thrombosis']) if label == 0 and type(text) != float]
negative_dict = collections.Counter()
for phrases in negative_words:
negative_dict += collections.Counter(phrases)
# clean up keywords
if "" in negative_dict:
negative_dict.pop("")
negative_top10 = sorted(negative_dict.items(),
key=lambda x: x[1], reverse=True)[:10]
negative_top10[('SLE', 142),
('SjS', 105),
('RA', 52),
('susp', 42),
('MCTD', 23),
('PSS', 21),
('APS', 16),
('PM', 16),
('IP', 13),
('Behcet', 11)]
3.2.3.2 Data format
Here, I change the examination date to datetime format.
/var/folders/qn/dr1_1v717pjdgfssc9z4gdyw0000gp/T/ipykernel_76472/3742017373.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
3.3 TSUMOTO_C
3.3.1 load data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57510 entries, 0 to 57509
Data columns (total 44 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 57510 non-null int64
1 Date 57510 non-null int64
2 GOT 46169 non-null object
3 GPT 46154 non-null object
4 LDH 46308 non-null object
5 ALP 45521 non-null object
6 TP 45636 non-null object
7 ALB 45321 non-null object
8 UA 45308 non-null object
9 UN 46123 non-null float64
10 CRE 46219 non-null object
11 T-BIL 38777 non-null object
12 T-CHO 42420 non-null object
13 TG 29258 non-null object
14 CPK 20564 non-null object
15 GLU 8432 non-null object
16 WBC 49605 non-null object
17 RBC 49605 non-null object
18 HGB 49606 non-null object
19 HCT 49606 non-null object
20 PLT 47905 non-null object
21 PT 4003 non-null float64
22 APTT 892 non-null float64
23 FG 3383 non-null object
24 PIC 799 non-null object
25 TAT 1363 non-null object
26 TAT2 992 non-null object
27 U-PRO 39108 non-null object
28 IGG 12329 non-null object
29 IGA 12326 non-null object
30 IGM 12321 non-null object
31 CRP 47101 non-null object
32 RA 8560 non-null object
33 RF 12567 non-null object
34 C3 26863 non-null object
35 C4 26864 non-null object
36 RNP 742 non-null object
37 SM 655 non-null object
38 SC170 138 non-null object
39 SSA 575 non-null object
40 SSB 544 non-null object
41 CENTROMEA 67 non-null object
42 DNA 432 non-null float64
43 DNA-II 0 non-null float64
dtypes: float64(5), int64(2), object(37)
memory usage: 19.3+ MB
3.3.2 Missing values and duplicates
The dataset contains missing values for some of the columns. All data have ID feature. Thus, it should be ok for joining analysis. Other column missing data shows below. The missing value will have different strategy to handle in EDA session.
| ID | Date | GOT | GPT | LDH | ALP | TP | ALB | UA | UN | ... | C3 | C4 | RNP | SM | SC170 | SSA | SSB | CENTROMEA | DNA | DNA-II | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2110 | 860419 | 24 | 12 | 152 | 63 | 7.5 | 4.5 | 3.4 | 16.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2110 | 860430 | 25 | 12 | 162 | 76 | 7.9 | 4.6 | 4.7 | 18.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2110 | 860502 | 22 | 8 | 144 | 68 | 7 | 4.2 | 5 | 18.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2110 | 860506 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2110 | 860507 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 57505 | 5834596 | 980730 | 34 | 61 | 199 | 213.0 | 7.9 | 3.5 | 4.9 | 10.9 | ... | NaN | NaN | negative | negative | NaN | 64 | negative | NaN | NaN | NaN |
| 57506 | 5834596 | 980820 | 34 | 36 | 177 | 278.0 | 9 | 4.1 | 4.1 | 17.9 | ... | 142 | 18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 57507 | 5834596 | 981015 | 27 | 20 | 153 | 181.0 | 8.4 | 3.8 | 4.6 | 14.7 | ... | 133 | 14 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 57508 | 5834596 | 981217 | 26 | 19 | 143 | 171.0 | 8.5 | 3.9 | 4.2 | 13.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 57509 | 5845877 | 980828 | 32 | 21 | 300 | 305.0 | 7.9 | 3.5 | 4.2 | 17.5 | ... | 116 | 19 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
57510 rows × 44 columns
ID 0
Date 0
GOT 11341
GPT 11356
LDH 11202
ALP 11989
TP 11874
ALB 12189
UA 12202
UN 11387
CRE 11291
T-BIL 18733
T-CHO 15090
TG 28252
CPK 36946
GLU 49078
WBC 7905
RBC 7905
HGB 7904
HCT 7904
PLT 9605
PT 53507
APTT 56618
FG 54127
PIC 56711
TAT 56147
TAT2 56518
U-PRO 18402
IGG 45181
IGA 45184
IGM 45189
CRP 10409
RA 48950
RF 44943
C3 30647
C4 30646
RNP 56768
SM 56855
SC170 57372
SSA 56935
SSB 56966
CENTROMEA 57443
DNA 57078
DNA-II 57510
dtype: int64
3.3.3 Data format
Here, I change the date to datetime format.
# measurement columns to numeric: for those don't have numeric entires, it will fill with NaN
clab_df_raw['GOT'] = pd.to_numeric(clab_df_raw['GOT'], errors='coerce')
clab_df_raw['GPT'] = pd.to_numeric(clab_df_raw['GPT'], errors='coerce')
clab_df_raw['LDH'] = pd.to_numeric(clab_df_raw['LDH'], errors='coerce')
clab_df_raw['ALP'] = pd.to_numeric(clab_df_raw['ALP'], errors='coerce')
clab_df_raw['TP'] = pd.to_numeric(clab_df_raw['TP'], errors='coerce')
clab_df_raw['ALB'] = pd.to_numeric(clab_df_raw['ALB'], errors='coerce')
clab_df_raw['UA'] = pd.to_numeric(clab_df_raw['UA'], errors='coerce')
clab_df_raw['UN'] = pd.to_numeric(clab_df_raw['UN'], errors='coerce')
clab_df_raw['CRE'] = pd.to_numeric(clab_df_raw['CRE'], errors='coerce')
clab_df_raw['T-BIL'] = pd.to_numeric(clab_df_raw['T-BIL'], errors='coerce')
clab_df_raw['T-CHO'] = pd.to_numeric(clab_df_raw['T-CHO'], errors='coerce')
clab_df_raw['TG'] = pd.to_numeric(clab_df_raw['TG'], errors='coerce')
clab_df_raw['CPK'] = pd.to_numeric(clab_df_raw['CPK'], errors='coerce')
clab_df_raw['GLU'] = pd.to_numeric(clab_df_raw['GLU'], errors='coerce')
clab_df_raw['WBC'] = pd.to_numeric(clab_df_raw['WBC'], errors='coerce')
clab_df_raw['RBC'] = pd.to_numeric(clab_df_raw['RBC'], errors='coerce')
clab_df_raw['HGB'] = pd.to_numeric(clab_df_raw['HGB'], errors='coerce')
clab_df_raw['HCT'] = pd.to_numeric(clab_df_raw['HCT'], errors='coerce')
clab_df_raw['PLT'] = pd.to_numeric(clab_df_raw['PLT'], errors='coerce')
clab_df_raw['PT'] = pd.to_numeric(clab_df_raw['PT'], errors='coerce')
clab_df_raw['APTT'] = pd.to_numeric(clab_df_raw['APTT'], errors='coerce')
clab_df_raw['FG'] = pd.to_numeric(clab_df_raw['FG'], errors='coerce')
clab_df_raw['PIC'] = pd.to_numeric(clab_df_raw['PIC'], errors='coerce')
clab_df_raw['TAT'] = pd.to_numeric(clab_df_raw['TAT'], errors='coerce')
clab_df_raw['U-PRO'] = pd.to_numeric(clab_df_raw['U-PRO'], errors='coerce')
clab_df_raw['IGG'] = pd.to_numeric(clab_df_raw['IGG'], errors='coerce')
clab_df_raw['IGA'] = pd.to_numeric(clab_df_raw['IGA'], errors='coerce')
clab_df_raw['IGM'] = pd.to_numeric(clab_df_raw['IGM'], errors='coerce')
clab_df_raw['RF'] = pd.to_numeric(clab_df_raw['RF'], errors='coerce')
clab_df_raw['C3'] = pd.to_numeric(clab_df_raw['C3'], errors='coerce')
clab_df_raw['C4'] = pd.to_numeric(clab_df_raw['C4'], errors='coerce')
clab_df_raw['DNA'] = pd.to_numeric(clab_df_raw['DNA'], errors='coerce')
clab_df_raw['DNA-II'] = pd.to_numeric(clab_df_raw['DNA-II'], errors='coerce')3.3.4 New columns
For lab test, there are criteria indicating whether a indicator is out of normal range. In this session, new variable introduced based on those criteria. Note: those with gender difference will be dealt with later.
clab_df = clab_df_raw.copy()
clab_df = clab_df.assign(is_GOT_normal=np.where(clab_df['GOT'] >= 60, 'no', np.where(clab_df['GOT'] < 60, 'yes', 'NaN')), # is_GOT_normal N<60
is_GPT_normal=np.where(clab_df['GPT'] >= 60, 'no', np.where(
clab_df['GPT'] < 60, 'yes', 'NaN')), # is_GPT_normal N<60
is_LDH_normal=np.where(clab_df['LDH'] >= 500, 'no', np.where(
clab_df['LDH'] < 500, 'yes', 'NaN')), # is_LDH_normal N<500
is_ALP_normal=np.where(clab_df['ALP'] >= 300, 'no', np.where(
clab_df['ALP'] < 300, 'yes', 'NaN')), # is_ALP_normal N<300
is_TP_normal=np.where(((clab_df['TP'] >= 8.5) | (clab_df['TP'] <= 6)), 'no', np.where(
((clab_df['TP'] < 8.5) & (clab_df['TP'] > 6)), 'yes', 'NaN')), # is_TP_normal 6<N<8.5
is_ALB_normal=np.where(((clab_df['ALB'] >= 5.5) | (clab_df['ALB'] <= 3.5)), 'no', np.where(
((clab_df['ALB'] < 5.5) & (clab_df['ALB'] > 3.5)), 'yes', 'NaN')), # is_ALB_normal 3.5<N<5.5
is_UN_normal=np.where(clab_df['UN'] <= 30, 'no', np.where(
clab_df['UN'] > 30, 'yes', 'NaN')), # is_UN_normal N>30
is_CRE_normal=np.where(clab_df['CRE'] <= 1.5, 'no', np.where(
clab_df['CRE'] > 1.5, 'yes', 'NaN')), # is_CRE_normal N>1.5
is_TBIL_normal=np.where(clab_df['T-BIL'] >= 2, 'no', np.where(
clab_df['T-BIL'] < 2, 'yes', 'NaN')), # is_T-BIL_normal N<2
is_TCHO_normal=np.where(clab_df['T-CHO'] >= 250, 'no', np.where(
clab_df['T-CHO'] < 250, 'yes', 'NaN')), # is_T-CHO_normal N<250
is_TG_normal=np.where(clab_df['TG'] >= 200, 'no', np.where(
clab_df['TG'] < 200, 'yes', 'NaN')), # is_TG_normal N<200
is_CPK_normal=np.where(clab_df['CPK'] >= 250, 'no', np.where(
clab_df['CPK'] < 250, 'yes', 'NaN')), # is_CPK_normal N<250
is_GLU_normal=np.where(clab_df['GLU'] >= 180, 'no', np.where(
clab_df['GLU'] < 180, 'yes', 'NaN')), # is_GLU_normal N<180
is_WBC_normal=np.where(((clab_df['WBC'] >= 9000) | (clab_df['WBC'] <= 3500)), 'no', np.where(
((clab_df['WBC'] < 9000) & (clab_df['WBC'] > 3500)), 'yes', 'NaN')), # is_WBC_normal 3500<N<9000
is_RBC_normal=np.where(((clab_df['RBC'] >= 600) | (clab_df['RBC'] <= 350)), 'no', np.where(
((clab_df['RBC'] < 600) & (clab_df['RBC'] > 350)), 'yes', 'NaN')), # is_RBC_normal 350<N<600
is_HGB_normal=np.where(((clab_df['HGB'] >= 17) | (clab_df['HGB'] <= 10)), 'no', np.where(
((clab_df['HGB'] < 17) & (clab_df['HGB'] > 10)), 'yes', 'NaN')), # is_HGB_normal 10<N<17
is_HCT_normal=np.where(((clab_df['HCT'] >= 52) | (clab_df['HCT'] <= 29)), 'no', np.where(
((clab_df['HCT'] < 52) & (clab_df['HCT'] > 29)), 'yes', 'NaN')), # is_HGB_normal 29<N<52
is_PLT_normal=np.where(((clab_df['PLT'] >= 400) | (clab_df['PLT'] <= 100)), 'no', np.where(
((clab_df['PLT'] < 400) & (clab_df['PLT'] > 100)), 'yes', 'NaN')), # is_PLT_normal 100<N<400
is_PT_normal=np.where(clab_df['PT'] >= 14, 'no', np.where(
clab_df['PT'] < 14, 'yes', 'NaN')), # is_PT_normal N<14
is_APTT_normal=np.where(clab_df['APTT'] >= 45, 'no', np.where(
clab_df['APTT'] < 45, 'yes', 'NaN')), # is_APTT_normal N<45
is_FG_normal=np.where(((clab_df['FG'] >= 450) | (clab_df['FG'] <= 150)), 'no', np.where(
((clab_df['FG'] < 450) & (clab_df['FG'] > 150)), 'yes', 'NaN')), # is_FG_normal 150<N<450
is_PIC_normal=np.where(clab_df['PIC'] >= 0.8, 'no', np.where(
clab_df['PIC'] < 0.8, 'yes', 'NaN')), # is_PIC_normal N<0.8
is_TAT_normal=np.where(clab_df['TAT'] >= 3, 'no', np.where(
clab_df['TAT'] < 3, 'yes', 'NaN')), # is_TAT_normal N<3
is_UPRO_normal=np.where(((clab_df['U-PRO'] >= 30) | (clab_df['U-PRO'] <= 0)), 'no', np.where(
((clab_df['U-PRO'] < 30) & (clab_df['U-PRO'] > 0)), 'yes', 'NaN')), # is_U-PRO_normal 0<N<30
is_IGG_normal=np.where(((clab_df['IGG'] >= 2000) | (clab_df['IGG'] <= 900)), 'no', np.where(
((clab_df['IGG'] < 2000) & (clab_df['IGG'] > 900)), 'yes', 'NaN')), # is_IGG_normal 900<N<2000
is_IGA_normal=np.where(((clab_df['IGA'] >= 500) | (clab_df['IGA'] <= 80)), 'no', np.where(
((clab_df['IGA'] < 500) & (clab_df['IGA'] > 80)), 'yes', 'NaN')), # is_IGA_normal 80<N<500
is_IGM_normal=np.where(((clab_df['IGM'] >= 400) | (clab_df['IGM'] <= 40)), 'no', np.where(
((clab_df['IGM'] < 400) & (clab_df['IGM'] > 40)), 'yes', 'NaN')), # is_IGM_normal 40<N<400
is_RF_normal=np.where(clab_df['RF'] >= 20, 'no', np.where(
clab_df['RF'] < 20., 'yes', 'NaN')), # is_RF_normal N<20
is_C3_normal=np.where(clab_df['C3'] <= 35, 'no', np.where(
clab_df['C3'] > 35, 'yes', 'NaN')), # is_C3_normal N>35
is_C4_normal=np.where(clab_df['C4'] <= 10, 'no', np.where(
clab_df['C4'] > 10, 'yes', 'NaN')), # is_C4_normal N>10
is_DNA_normal=np.where(clab_df['DNA'] >= 8, 'no', np.where(
clab_df['DNA'] < 8., 'yes', 'NaN')), # is_DNA_normal N<8
is_DNAII_normal=np.where(clab_df['DNA-II'] >= 8, 'no', np.where(
clab_df['DNA-II'] < 8, 'yes', 'NaN')), # is_DNAII_normal N<8
)3.4 Join data
For analysis using patient information, three dataset are joined to see if there are any patterns.
# cleaned a data
patients_df = pd.read_csv("../data/tsumotoa_clean.csv")
patients_df['b-day'] = pd.to_datetime(patients_df['b-day'])
patients_df['d-day'] = pd.to_datetime(patients_df['d-day'])
patients_df['f-day'] = pd.to_datetime(patients_df['f-day'])
# cleaned a data
slab_df = pd.read_csv("../data/tsumotob_clean.csv")
slab_df['Examination Date'] = pd.to_datetime(slab_df['Examination Date'])
# left join
join_df = clab_df.merge(patients_df, on='ID', how='left')
# outer join on b and c data
join_df = join_df.merge(slab_df, on='ID', how='outer')# date formatting
join_df['b-day'] = pd.to_datetime(join_df['b-day'])
join_df['d-day'] = pd.to_datetime(join_df['d-day'])
join_df['f-day'] = pd.to_datetime(join_df['f-day'])
join_df['Examination Date'] = pd.to_datetime(join_df['Examination Date'])
join_df['Examination Date'] = pd.to_datetime(join_df['Examination Date'])3.4.1 Missing values
There are missing values in the data. some of there are caused by merging dataset. For example, there are patient only did special lab test but not regular, or the other way around. For now, we are not removing any of them. We will deal with missing values in EDA session with various strategies according to the features.
ID 0
Date 353
GOT 11732
GPT 11713
LDH 11594
...
KCT 49373
RVVT 49373
LAC 46353
Symptoms 52459
Thrombosis 37583
Length: 98, dtype: int64
3.4.2 New columns
Age of special exam is the age the patient did the special examination.
join_df_2['age_of_sexam'] = join_df_2.apply(
lambda row: relativedelta.relativedelta(row['Examination Date'], row['b-day']).years, axis=1)
# sanity check
join_df_2.head(5)/var/folders/qn/dr1_1v717pjdgfssc9z4gdyw0000gp/T/ipykernel_76472/907839400.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| ID | Date | GOT | GPT | LDH | ALP | TP | ALB | UA | UN | ... | ANA | ANA Pattern | aCL IgA | Diagnosis | KCT | RVVT | LAC | Symptoms | Thrombosis | age_of_sexam | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1421 | 133382.0 | 1970-01-01 00:00:00.000920910 | 17.0 | 16.0 | 256.0 | 148.0 | 6.7 | 3.6 | 5.8 | 10.0 | ... | 1024 | S | 0.0 | SLE | - | - | - | NaN | 0.0 | 62 |
| 1422 | 133382.0 | 1970-01-01 00:00:00.000920924 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1024 | S | 0.0 | SLE | - | - | - | NaN | 0.0 | 62 |
| 1423 | 133382.0 | 1970-01-01 00:00:00.000921013 | 16.0 | 14.0 | 237.0 | 148.0 | 6.7 | 3.6 | 5.4 | 12.0 | ... | 1024 | S | 0.0 | SLE | - | - | - | NaN | 0.0 | 62 |
| 1424 | 133382.0 | 1970-01-01 00:00:00.000921019 | 17.0 | 14.0 | 253.0 | 148.0 | 6.7 | 3.6 | 6.1 | 13.0 | ... | 1024 | S | 0.0 | SLE | - | - | - | NaN | 0.0 | 62 |
| 1425 | 133382.0 | 1970-01-01 00:00:00.000921020 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1024 | S | 0.0 | SLE | - | - | - | NaN | 0.0 | 62 |
5 rows × 99 columns
Is UA normal depends on the gender, N > 8.0 (Male) N > 6.5 (Female).
join_df = join_df.assign(is_UA_normal=np.where(((join_df['UA'] <= 8) & (join_df['SEX'] == 'M')), 'no', np.where(((join_df['UA'] < 8) & (join_df['SEX'] == 'M')), 'yes', np.where(((join_df['UA'] <= 6.5) & (join_df['SEX'] == 'F')), 'no',np.where(((join_df['UA'] > 6.5) & (join_df['SEX'] == 'F')), 'yes', 'NaN'))))) # N > 8.0 (Male) N > 6.5 (Female)4 Exploratory analysis
4.1 Individual datasets
4.1.1 Summary statistics of patient information
# Defining colors for the pie chart
colors = ['pink', 'steelblue']
# Define the ratio of gap of each fragment in a tuple
explode = (0.05, 0.05)
# plot pie chart as grouped by sex
fig = plt.figure(1)
piechart = patients_df['SEX'].value_counts().plot(kind='pie',
autopct='%1.0f%%',
colors=colors,
explode=explode,
labels=["Female", "Male"],
fontsize=12)
plt.legend(bbox_to_anchor=(-0.05, -0.1), loc='lower left', fontsize=12)
plt.suptitle('Patient gender distribution', fontsize=16)
fig.show();Female is the dominated gender of the patients (Figure 1).
fig = plt.figure(2)
plt.scatter(patients_df['d-day'], patients_df['f-day'])
plt.ylim([pd.Timestamp('1970-01-01'), pd.Timestamp('2000-12-31')])
plt.title("description date VS first date")
plt.xlabel("description date")
plt.ylabel("first date")
fig.show();
# first date always smaller than description date
# can't use one date to replace another oneFirst date always smaller than description date, so that we can’t use one date to replace another one (Figure 2). There are some years have less or no data (i.e. 1995). #### Special lab information Thrombosis level vs examination date (Figure 3). We can see the large portion of case happened after 1992. 1998 is the year have the greatest number of cases.
Plot keywords and frequency (Figure 4).
# join keywords
key_set = {k for k, _ in positive_top10} | {k for k, _ in negative_top10}
# order by Thrombosis positive decreasing
key_ls = sorted(key_set, key=lambda x: (positive_dict[x], negative_dict[x]))
# plot keywords RELATIVE frequency
nb_positive, nb_negative = sum(
slab_df['Thrombosis'] > 0), sum(slab_df['Thrombosis'] == 0)
fig = plt.figure(4)
hN = plt.barh(key_ls, [negative_dict[k] /
nb_negative for k in key_ls], label='negative', color='g')
hS = plt.barh(key_ls, [-positive_dict[k] /
nb_positive for k in key_ls], label='positive')
plt.xlim([-1, 1])
xt = plt.xticks()
n = xt[0]
s = ['%.1f' % abs(i) for i in n]
plt.xticks(n, s)
plt.legend(loc='best')
plt.axvline(0.0)
fig.show();Correlation among three index in special lab results (Figure 5).
#
# # transfer by log(x+1)
# correlation could be misleading as too much 0 values
# red are Thrombosis and blue are not Thrombosis
temp = slab_df[['aCL IgG', 'aCL IgM', 'aCL IgA']].copy()
temp = temp.apply(lambda x: np.log10(x+1))
d_colors = {0: "blue", 1: "green", 2: "m", 3: "k"}
colors = [d_colors[x] for x in slab_df['Thrombosis']]
axl = pd.plotting.scatter_matrix(temp, color=colors)Parallel coordinates for Thrombosis levels.
KCT_present = slab_df['KCT'].apply(lambda x: type(x)) != float
RVVT_present = slab_df['RVVT'].apply(lambda x: type(x)) != float
LAC_present = slab_df['LAC'].apply(lambda x: type(x)) != float
print("There are %s records with all three indexes present." %
sum(KCT_present & RVVT_present & LAC_present))
temp_df = slab_df[['Thrombosis', 'KCT', 'RVVT', 'LAC']
][KCT_present & RVVT_present & LAC_present].copy()
fig = plt.figure(6)
pd.plotting.parallel_coordinates(
temp_df, 'Thrombosis', sort_labels=True, colormap='viridis')
fig.show();There are 145 records with all three indexes present.
The overlap of parallel coordinates plot (Figure 6) indicates that three indicator combination is not a good indicator of the level of Thrombosis.
4.1.2 General lab information
fig = plt.figure(7,figsize=(12, 10))
corr = clab_df.corr(method='spearman')
sns.heatmap(corr, annot=True, annot_kws={"size": 28 / np.sqrt(len(corr))})
fig.show();There are some correlation in the data (Figure 7). For example, APTT and PT has high negative correlation. FG and APTT also has correlation of 0.53. If we want to do further analysis on the numerical value of those measurements, these correlations need to be considered.
4.2 Join dataset
## plot age and Thrombosis level
temp = join_df_2.pivot(columns='Thrombosis', values='age_of_sexam')
ax = temp.plot.hist(bins=20)Different level of Thrombosis have various age distribution (Figure 8). Regardless of number of samples, severe thrombosis happens high in 30’s, while level 2 thrombosis happens during younger age.
temp_df = join_df[['Thrombosis', 'is_GOT_normal', 'is_GPT_normal',
'is_LDH_normal', 'is_ALP_normal', 'is_TP_normal', 'is_ALB_normal',
'is_UN_normal', 'is_CRE_normal', 'is_TBIL_normal', 'is_TCHO_normal',
'is_TG_normal', 'is_CPK_normal', 'is_GLU_normal', 'is_WBC_normal',
'is_RBC_normal', 'is_HGB_normal', 'is_HCT_normal', 'is_PLT_normal',
'is_PT_normal', 'is_APTT_normal', 'is_FG_normal', 'is_PIC_normal',
'is_TAT_normal', 'is_UPRO_normal', 'is_IGG_normal', 'is_IGA_normal',
'is_IGM_normal', 'is_RF_normal', 'is_C3_normal', 'is_C4_normal',
'is_DNA_normal', 'is_DNAII_normal','is_UA_normal']
]
temp_df = temp_df.dropna() ## don't need those without these information
temp_df.reset_index(drop=True)
fig = plt.figure(9)
pd.plotting.parallel_coordinates(
temp_df, 'Thrombosis', sort_labels=True, colormap='viridis')
plt.xticks(rotation=45)
fig.show();5 Final plots
5.1 Plot-01: Thrombosis VS Age
/Users/xiaojingni/.matplotlib
plt.style.use("anly503")
## plot age and Thrombosis level
join_df_2['Thrombosis'] = join_df_2['Thrombosis'].astype(float).astype(int)
temp = join_df_2.pivot(columns='Thrombosis', values='age_of_sexam')
ax = temp.plot.hist(bins=20,edgecolor = "white")
ax.figure.set_figwidth(10)
ax.figure.set_figheight(7)
plt.title("Thrombosis VS Age")
plt.xlabel("Age")
plt.legend(title="Thrombosis level")
text =plt.text(s='Note: 0 is no thrombosis',y=-220,x=-13,fontsize=14)
plt.savefig("../plot/plot-01.png");5.2 Plot-02: Thrombosis diagnosis keywords
from matplotlib.ticker import FuncFormatter
fig = plt.figure(11)
fig.figure.set_figwidth(13)
fig.figure.set_figheight(7)
hN = plt.barh(key_ls, [negative_dict[k] /
nb_negative for k in key_ls], label='negative', color='g')
hS = plt.barh(key_ls, [-positive_dict[k] /
nb_positive for k in key_ls], label='positive')
plt.xlim([-1, 1])
xt = plt.xticks()
n = xt[0]
s = ['{:,.0%}'.format(abs(i)) for i in n]
plt.xticks(n, s)
plt.legend(loc='best')
plt.axvline(0.0)
plt.title("Thrombosis diagnosis keywords")
plt.xlabel("Percentage")
plt.ylabel("Diagnosis")
plt.legend(title="Thrombosis diagnosis")
plt.savefig("../plot/plot-02.png");fig = plt.figure(12,figsize=(14, 12))
temp = clab_df.loc[:, clab_df.columns != 'ID']
corr = temp.corr(method='spearman')
sns.heatmap(corr, annot=True, annot_kws={"size": 28 / np.sqrt(len(corr))})
plt.savefig("../plot/plot-03.png");from matplotlib.patches import Rectangle
temp_df['Thrombosis'] = temp_df['Thrombosis'].astype(float).astype(int)
temp_df.reset_index(drop=True)
temp_df1= temp_df[temp_df['Thrombosis'].isin([1,2,3])]
temp_df1.reset_index(drop=True)
# temp_df0= temp_df[temp_df['Thrombosis']=="0"]
# temp_df0.reset_index(drop=True)
# temp_df2= temp_df[temp_df['Thrombosis']=="1"]
# temp_df2.reset_index(drop=True)
# temp_df3= temp_df[temp_df['Thrombosis']=="2"]
# temp_df3.reset_index(drop=True)
# temp_df4= temp_df[temp_df['Thrombosis']=="3"]
# temp_df4.reset_index(drop=True)
plt.figure(13,figsize=(18, 10))
pd.plotting.parallel_coordinates(
temp_df1.sort_values(by='Thrombosis'), 'Thrombosis', sort_labels=True,color=( '#FF6B6B','#556270', '#4ECDC4'))
plt.xticks(rotation=45)
plt.title("General lab examinatoion VS Thrombosis", pad=20, fontsize = 22)
plt.xlabel("Examinatoion index",fontsize = 18)
plt.ylabel("Results",fontsize = 18)
plt.legend(title="Thrombosis level",bbox_to_anchor=(1, 0.8))
plt.savefig("../plot/plot-04.png", bbox_inches = 'tight');